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ABSTRACT 

Collocations  are  notoriously  difficult  for  non-native  speakers  to 
translate,  primarily  because  they  are  opaque  and  can  not  be  translated 
on  a  word  by  word  basis.  We  describe  a  program  named  Champollion 
which,  given  a  pair  of  parallel  corpora  in  two  different  languages, 
automatically  produces  translations  of  an  input  list  of  collocations. 
Our  goal  is  to  provide  a  tool  to  compile  bilingual  lexical  information 
above  the  word  level  in  multiple  languages  and  domains.  The  algo¬ 
rithm  we  use  is  based  on  statistical  methods  and  produces  p  word 
translations  of  n  word  collocations  in  which  n  and  p  need  not  be 
the  same;  the  collocations  can  be  either  flexible  or  fixed  compounds. 
For  example,  Champollion  translates  “to  make  a  decision,"  “employ¬ 
ment  equity,"  and  “stock  market,"  respectively  into:  “prendre  une 
decision,"  “6quit6  en  matibre  d’emploi,"  and  “bourse."  Testing  and 
evaluation  of  Champollion  on  one  year’s  worth  of  the  Hansards  cor¬ 
pus  yielded  300  collocations  and  their  translations,  evaluated  at  77% 
accuracy.  In  this  paper,  we  describe  the  statistical  measures  used, 
the  algorithm,  and  the  implementation  of  Champollion,  presenting 
our  results  and  evaluation. 

1.  Introduction 

Hieroglyphics  remained  undeciphered  for  centuries  until  the  discov¬ 
ery  of  the  Rosetta  Stone  in  the  beginning  of  the  19th  century  in 
Rosetta,  Egypt.  The  Rosetta  Stone  is  a  tablet  of  black  basalt  contain¬ 
ing  parallel  inscriptions  in  three  different  writings;  one  in  greek,  and 
the  two  others  in  two  different  forms  of  ancient  Egyptian  writings 
(demotic  and  hieroglyphics).  Jean-Francois  Champollion,  a  linguist 
and  egyptologist,  made  the  assumption  that  these  inscriptions  were 
parallel  and  managed  after  several  years  of  research  to  decipher  the 
hyerogliphic  inscriptions.  He  used  his  work  on  the  Rosetta  Stone  as 
a  basis  from  which  to  produce  the  first  comprehensive  hyeroglyphics 
dictionary. 

In  this  paper,  we  describe  a  modem  version  of  a  similar  approach: 
given  a  large  corpus  in  two  languages,  our  program,  Champollion, 
produces  translations  of  common  word  pairs  and  phrases  which  can 
form  the  basis  for  a  bilingual  lexicon.  Our  focus  is  on  the  use 
of  statistical  methods  for  the  translation  of  multi-word  expressions, 
such  as  collocations,  which  cannot  consistently  be  translated  on  a 
word  by  word  basis.  Bilingual  collocation  dictionaries  are  currently 
unavailable  even  in  languages  such  as  French  and  English  despite 
the  fact  that  collocations  have  been  recognized  as  one  of  the  main 
obstacles  to  second  language  acquisition  [IS]. 

We  developed  a  program,  Champollion,  which  translates  colloca¬ 
tions  using  an  aligned  parallel  bilingual  corpus,  or  database  cor¬ 
pus,  as  a  reference.  It  represents  Champollion’s  knowledge  of  both 
languages.  For  a  given  source  language  collocation,  Champollion 
uses  statistical  methods  to  incrementally  construct  the  collocation 


translation,  adding  one  word  at  a  time.  Champollion  first  identifies 
individual  words  in  the  target  language  which  are  highly  correlated 
with  the  source  collocation.  Then,  it  identifies  any  pairs  in  this  set 
of  individual  words  which  are  highly  correlated  with  the  source  col¬ 
location.  Similarly,  triplets  are  produced  by  adding  a  word  to  a  pair 
if  it  is  highly  correlated,  and  so  forth  until  no  higher  combination 
of  words  is  found.  Champollion  selects  as  the  target  collocation 
the  group  of  words  with  highest  cardinality  and  correlation  factor. 
Finally,  it  orders  the  words  of  the  target  collocation  by  examining 
samples  in  the  corpus.  If  word  order  is  variable  in  the  target  collo¬ 
cation,  Champollion  labels  it  flexible  (as  in  to  take  steps  to  which 
can  appear  as:  took  steps  to,  steps  were  taken  to,  etc.). 

To  evaluate  Champollion,  we  used  a  collocation  compiler, 
Xtract[12],  to  automatically  produce  several  lists  of  source  (En¬ 
glish)  collocations.  These  source  collocations  contain  both  flexible 
word  pairs  which  can  be  separated  by  an  arbitrary  number  of  words, 
and  fixed  constituents,  such  as  compound  noun  phrases.  We  then 
ran  Champollion  on  separate  corpora,  each  consisting  of  one  year’s 
worth  of  data  extracted  from  the  Hansards  Corpus.  We  asked  several 
humans  who  are  conversant  in  both  French  and  English  to  judge  the 
results.  Accuracy  was  rated  at  77%  for  one  test  set  and  61%  for  the 
second  set.  In  our  discussion  of  results,  we  show  how  problems  for 
the  second  test  set  can  be  alleviated. 

In  the  following  sections,  we  first  describe  the  algorithm  and  statistics 
used  in  Champollion,  we  then  present  our  evaluation  and  results,  and 
finally,  we  move  to  a  discussion  of  related  work  and  our  conclusions. 

2.  Champollion:  Algorithm  and  Statistics 

Champollion’s  algorithm  relies  on  the  following  two  assumption: 

•  If  two  groups  of  words  are  translations  of  one  another,  then 
the  number  of  paired  sentences  in  which  they  appear  in  the 
database  corpus  is  greater  than  expected  by  chance.  In  other 
words,  the  two  groups  of  words  are  correlated. 

•  If  a  set  of  words  is  correlated  with  the  source  collocation,  its 
subsets  will  also  be  correlated  with  the  source  collocation. 


The  first  assumption  allows  us  to  use  a  correlation  measure  as  a  basis 
for  producing  translations,  and  the  second  assumption  allows  us  to 
reduce  our  search  from  exponential  time  to  constant  time  (on  the 
size  of  the  corpus)  using  an  iterative  algorithm.  In  this  section,  we 
first  describe  prerequisites  necessary  before  running  Champollion, 
we  then  describe  the  correlation  statistics,  and  finally  we  describe 
the  algorithm  and  its  implementation. 
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2.1.  Preprocessing. 

There  are  two  steps  that  must  be  carried  out  before  running  Cham- 
pollion.  The  database  corpus  must  be  aligned  sentence  wise  and  a 
list  of  collocations  to  be  translated  must  be  provided  in  the  source 
language. 

Aligning  the  database  corpus  Champollion  requires  that  the  data 
base  corpus  be  aligned  so  that  sentences  that  are  translations  of  one 
another  are  co-indexed.  Most  bilingual  corpora  are  given  as  two 
separate  (sets  of)  files.  The  problem  of  identifying  which  sentences 
in  one  language  correspond  to  which  sentences  in  the  other  is  com¬ 
plicated  by  the  fact  that  sentence  order  may  be  reversed  or  several 
sentences  may  translate  a  single  sentence.  Sentence  alignment  pro¬ 
grams  (i.e.,  [10],  [2],  [11],  [1],  [4])  insert  identifiers  before  each 
sentence  in  the  source  and  the  target  text  so  that  translations  are 
given  the  same  identifier.  For  Champollion,  we  used  corpora  that 
had  been  aligned  by  Church’s  sentence  alignment  program  [10]  as 
our  input  data.' 

Providing  Champollion  with  a  list  of  source  collocations  A  list 
of  source  collocations  can  be  compiled  manually  by  experts,  but 
it  can  also  be  compiled  automatically  by  tools  such  as  Xtract  [17], 
[12].  Xtract  produces  a  wide  range  of  collocations,  including  flexible 
collocations  of  the  type  “to  make  a  decision,"  in  which  the  words 
can  be  inflected,  the  word  order  might  change  and  the  number  of 
additional  words  can  vary.  Xtract  also  produces  compounds,  such 
as  “The  Dow  Jones  average  of  30  industrial  stock,"  which  are  rigid 
collocations.  We  used  Xtract  to  produce  a  list  of  input  collocations 
for  Champollion. 

2.2.  Statistics  used:  The  Dice  coefficient. 

There  are  several  ways  to  measure  the  correlation  of  two  events. 
In  information  retrieval,  measures  such  as  the  cosine  measure,  the 
Dice  coefficient,  and  the  Jaccard  coefficient  have  been  used  [21],  [S], 
while  in  computational  linguistics  mutual  information  of  two  events 
is  most  widely  used  (i.e.,  [18],  [19]).  For  this  research  we  use  the 
Dice  coefficient  because  it  offers  several  advantages  in  our  context. 

Let  X  and  y  be  two  basic  events  in  our  probability  space,  representing 
the  occurrence  of  a  given  word  (or  group  of  words)  in  the  English 
and  French  corpora  respectively.  Let  f(x)  represent  the  frequency 
of  occurrence  of  event  x,  i.e.,  the  number  of  sentences  containing  x. 
Then  p(x),  the  probability  of  event  x,  can  be  estimated  by  f(x)  divided 
by  the  total  number  of  sentences.  Similarly,  the  joint  probability  of 
X  and  y,  p{x  A  y)  is  the  number  of  sentences  containing  x  in  their 
English  version  and  y  in  their  French  version  (/(i  A  y))  divided  by 
the  total  number  of  sentences.  We  can  now  define  the  Dice  coefficient 
and  the  mutual  information  of  of  x  and  y  as; 

Dice{x,  y)  =  Ax 
MU{x,y)^log(jJ^f^^)  +  B 

In  which  A  and  B  are  constants  related  to  the  size  of  the  corpus. 

We  found  the  Dice  Coefficient  to  be  better  suited  than  the  more  widely 
used  mutual  information  to  our  problem.  We  are  looking  for  a  clear 
cut  test  that  would  decide  when  two  events  are  correlated.  Both  for 

'  We  are  thankful  to  Ken  Church  and  the  Bell  Laboratories  for  providing 
us  with  a  prealigned  Hansards  corpus. 


mutual  information  and  the  Dice  coefficient  this  involves  comparison 
with  a  threshold  that  has  to  be  determined  by  experimentation.  While 
both  measures  are  similar  in  that  they  compare  the  joint  probability 
of  the  two  events  (p(i  A  y))  with  their  independent  probabilities, 
they  have  different  asymptotic  behaviors.  For  example, 

•  when  the  two  events  are  perfectly  independent,  p{x  A  j^)  = 
p(x)  y.p{y). 

•  when  one  event  is  fully  determined  by  the  other  (y  occurs  when 
and  only  when,  x  occurs),  p{x  Ay)  =  p{x). 

In  the  first  case,  mutual  information  is  equal  to  a  constant  and  is  thus 
easily  testable,  whereas  the  Dice  coefficient  is  equal  to 
and  is  thus  a  function  of  the  individual  frequencies  of  x  and  y.  In 
this  case,  the  test  is  easier  to  decide  when  using  mutual  information. 
In  case  two,  the  results  are  reversed;  mutual  information  is  equal 
to:  — log(/(i))  and  thus  grows  with  the  inverse  of  the  individual 
frequency  of  x,  whereas  the  Dice  coefficient  is  equal  to  a  constant. 
Not  only  is  the  test  is  easier  to  decide  using  the  Dice  Coefficient  in 
this  case,  but  also  note  that  low  frequency  events  will  have  higher 
mutual  information  than  high  frequency  events,  a  counter-intuitive 
result.  Since  we  are  looking  for  a  way  to  identify  correlated  events 
we  must  be  able  to  easily  identify  the  coefficient  when  the  two  events 
are  perfectly  correlated  as  in  case  two. 

Another  reason  that  mutual  information  is  less  appropriate  for  our 
task  than  the  Dice  Coefficient  is  that  it  is,  by  definition,  symmetric, 
weighting  equally  one-one  and  zero-zero  matches,  while  the  Dice 
Coefficient  gives  more  weight  to  one-one  matches.  One-one  matches 
are  cases  where  both  source  and  target  words  (or  word  groups)  ap¬ 
pear  in  corresponding  sentences,  while  in  zero-zero  matches,  neither 
source  nor  target  words  (or  word  groups)  appear. 

In  short,  we  prefer  the  use  of  the  Dice  coefficient  because  it  is  a  better 
indicator  of  similarity.  We  confirmed  the  performance  of  the  Dice 
over  mutual  information  experimentally  as  well.  In  our  tests  with  a 
small  sample  of  collocations,  the  Dice  Coefficient  corrected  errors 
introduced  by  mutual  information  and  never  contradicted  mutual 
information  when  it  was  correct  [20]. 

2.3.  Description  of  the  algorithm. 

For  a  given  source  collocation,  Champollion  produces  the  target  col¬ 
location  by  first  computing  the  set  of  single  words  that  are  highly 
correlated  with  the  source  collocation  and  then  searching  for  any 
combination  of  words  in  that  set  with  a  high  correlation  with  the 
source.  In  order  to  avoid  computing  and  testing  every  possible  com¬ 
bination  which  would  yield  a  search  space  equal  to  the  powerset 
of  the  set  of  highly  correlated  individual  words,  Champollion  itera¬ 
tively  searches  the  set  of  combinations  containing  n  words  by  adding 
one  word  from  the  original  set  to  each  combination  of  (n  -1)  word 
that  has  been  identified  as  highly  correlated  to  the  source  colloca¬ 
tion.  At  each  stage,  Champollion  throws  out  any  combination  with 
a  low  correlation,  thereby  avoiding  examining  any  supersets  of  that 
combination  in  a  later  stage.  The  algorithm  can  be  described  more 
formally  as  follows: 

Notation;  LI  and  L2  are  the  two  languages  used,  and  the  following 
symbols  are  used: 

•  S:  source  collocation  in  LI 

•  T:  target  collocation  in  L2 
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•  WS:  list  of  L2  words  correlated  with  S 

•  P(WS);  powersetof  WS 

•  n:  number  of  elements  of  P(WS) 

•  CC:  list  of  candidate  taiget  L2  collocations 

•  lP(i,  WS);  subset  of  P(WS)  containing  all  the  i-tuples 

•  CT:  correlation  threshold  fixed  by  experimentation. 

Step  1:  Initialization  of  the  work  space.  Collect  all  the  words 
in  L2  that  are  correlated  with  S,  producing  WS.  At  this  point,  the 
search  space  is  P(WS);  i.e.,  T  is  an  element  of  P(WS).  Champollion 
searches  this  space  in  Step  2  in  an  iterative  manner  by  looking  at 
groups  of  words  of  increasing  cardinality. 

Step  2:  Main  iteration. 

V  i  in  (1,2,  3, ...,  n} 

1.  Construct  P(i,  WS). 

P(i,  WS)  is  constructed  by  considering  all  the  i-tuples  from 
P(WS)  that  are  supersets  of  elements  of  P(i-1 ,  WS).  We  define 
P(0,  WS)  as  nuU. 

2.  Compute  correlation  scores  for  all  elements  of  P(i,  WS).  Elim¬ 
inate  from  P(i,  WS)  all  elements  whose  scores  are  below  CT. 

3.  If  P(i,  WS)  is  empty  exit  the  iteration  loop. 

4.  Add  the  element  of  P(i,WS)  with  highest  score  to  CC. 

5.  Increment  i  and  go  back  to  beginning  of  the  iteration  loop  item 

1. 

Step  3:  Determination  of  the  best  translation.  Among  ail  the 
elements  of  CC  select  as  the  target  collocation  T,  the  element  with 
highest  correlation  factor.  When  two  elements  of  CC  have  the  same 
correlation  factor  then  we  select  the  one  containing  the  largest  num¬ 
ber  of  words. 

Step  4 ;  Determination  of  word  ordering.  Once  the  translation  has 
been  selected,Champollion  examines  all  the  sentences  containing  the 
selected  translation  in  order  to  determine  the  type  of  the  collocation, 
i.e.,  if  the  collocation  is  flexible  (i.e.,  word  order  is  not  fixed)  or  if 
the  collocation  is  rigid.  This  is  done  by  looking  at  all  the  sentences 
containing  the  target  collocation  and  determining  if  the  words  are 
used  in  the  same  order  in  the  majority  of  the  cases  and  at  the  same 
distance  from  one  another.  In  cases  when  the  collocation  is  rigid, 
then  the  word  order  is  also  produced.  Note  that  although  this  is  done 
as  a  post  processing  stage,  it  does  not  require  rereading  the  corpus 
since  the  information  needed  has  already  been  precomputed. 

Example  output  of  Champollion  is  given  in  Table  1 .  Flexible  collo¬ 
cations  are  shown  with  a  “...”  indicating  where  additional,  variable 
words  could  appear.  These  examples  show  cases  where  a  two  word 
collocation  is  translated  as  one  word  (e.g.,  “health  insurance”),  a  two 
word  collocation  is  translated  as  three  words  (e.g.,  “employment 
equity”),  and  how  words  can  be  inverted  in  the  translation  (e.g., 
“advance  notice”). 

3.  Evaluation 

We  are  carrying  out  three  tests  with  Champollion  with  two  data  base 
corpora  and  three  sets  of  source  collocations.  The  first  data  base 
corpus  (DBl)  consist  of  8  months  of  Hansards  aligned  data  taken 


Experiment 

OK 

X 

w 

Overall 

Cl/DBl 

70 

11 

19 

77 

C2/DB1 

58 

11 

31 

61 

Table  2;  Evaluation  results  for  Champollion. 


from  1986  and  the  second  data  base  corpus  consists  of  all  of  the 
1986  and  1987  transcripts  of  the  Canadian  Parliament.  The  first  set 
of  source  collocations  (Cl)  are  300  collocations  identified  by  Xtract 
on  all  data  from  1986,  the  second  set  (C2)  is  a  set  of  300  collocations 
identified  by  Xtract  on  all  data  from  1987,  and  the  third  set  of  collo¬ 
cations  (C3)  consists  of  300  collocations  identified  by  Xtract  on  all 
data  from  1988.  We  used  DBl  with  both  Cl  (experiment  1)  and  C2 
(experiment  2)  and  are  currently  using  DB2  on  C3  (experiment  3). 
Results  from  the  third  experiment  were  not  yet  available  at  time  of 
publication. 

We  asked  three  bilingual  speakers  to  evaluate  the  results  for  the  dif¬ 
ferent  experiments  and  the  results  are  shown  in  Table  2.  The  second 
column  gives  the  percentage  of  correct  translations,  the  third  col¬ 
umn  gives  the  percentage  of  Xtract  errors,  the  fourth  column  gives 
the  percentage  of  Champollion’s  errors,  and  the  last  column  gives 
the  percentage  of  Champollion’s  correct  translation  if  the  input  is 
filtered  of  errors  introduced  by  Xtract.  Averages  of  the  three  eval¬ 
uators’  scores  are  shown,  but  we  noted  that  scores  of  individual 
evaluators  were  within  1-2%  of  each  other;  thus,  there  was  high 
agreement  between  judges.  The  best  results  are  obtained  when  the 
data  base  corpus  is  also  used  as  a  training  corpus  for  Xtract;  ig¬ 
noring  Xtract  errors  the  evaluation  is  as  high  as  77%.  The  second 
experiment  produces  low  results  as  many  input  collocations  did  not 
appear  often  enough  in  the  database  corpus.  We  hope  to  show  that 
we  can  compensate  for  this  by  increasing  the  corpus  size  in  the  third 
experiment. 

One  class  of  Champollion’s  errors  arises  because  it  does  not  .trans¬ 
late  closed  class  words  such  as  prepositions.  Since  the  frequency  of 
prepositions  is  so  high  in  comparison  to  open  class  words,  including 
them  in  the  translations  throws  off  the  correlations  measures.  Trans¬ 
lations  that  should  have  included  prepositions  were  judged  inaccurate 
by  our  evaluators  and  this  accounted  for  approximately  5%  of  the 
errors.  This  is  an  obvious  place  to  begin  improving  the  accuracy  of 
Champollion. 

4.  Related  Work. 

The  recent  availability  of  large  amounts  of  bilingual  data  has  attracted 
interest  in  several  areas,  including  sentence  alignment  [10],  [2],  [11], 
[1],  [4],  word  alignment  [6],  alignment  of  groups  of  words  [3],  [7], 
and  statistical  translation  [8],  Of  these,  aligning  groups  of  words 
is  most  similar  to  the  work  reported  here,  although  we  consider 
a  greater  variety  of  groups.  Note  that  additional  research  using 
bilingual  corpora  is  less  related  to  oiirs,  addressing,  for  example, 
word  sense  disambiguation  in  the  source  language  by  examining 
different  translations  in  the  target  [9],  [8]. 

One  line  of  research  uses  statistical  techniques  only  for  machine 
translation  [8].  Brown  et.  al.  use  a  stochastic  language  model 
based  on  the  techniques  used  in  speech  recognition  [19],  combined 
with  translation  probabilities  compiled  on  the  aligned  corpus  in  or¬ 
der  to  do  sentence  translation.  T^e  project  produces  high  quality 
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English 

French  Equivalent 

advance  notice 

pr6venu  avance 

additional  cost 

cofits  suppldmentaires 

apartheid ...  South  Africa 

apartheid ...  afrique  sud 

affirmative  action 

action  positive 

collective  agreement 

convention  collective 

free  trade 

libre-dchange 

freer  trade 

liberalisation ...  dchanges 

head  office 

siege  social 

health  insurance 

assurance-maladie 

employment  equity 

equite ...  mati'ere  ...  emploi 

make  a  decision 

prendre ...  decisions 

to  take  steps 

prendre ...  mesures 

to  demonstrate  support 

prouver..  adhesion 

T^ble  1 ;  Some  Translations  produced  by  Champollion. 


translations  for  shorter  sentences  (see  Berger  et.  ai,  this  volume, 
for  information  on  most  recent  results)  using  little  linguistic  and  no 
semantic  information.  While  they  also  align  groups  of  words  across 
languages  in  the  process  of  translation,  they  are  careful  to  point  out 
that  such  groups  may  or  may  not  occur  at  constituent  breaks  in  the 
sentence.  In  contrast,  our  work  aims  at  identifying  syntactically  and 
semantically  meaningful  units,  which  may  either  be  constituents  or 
flexible  word  pairs  separated  by  intervening  words,  and  provides  the 
translation  of  these  units  for  use  in  a  variety  of  bilingual  applications. 
Thus,  the  goals  of  our  research  are  somewhat  different. 

Kupiec  [3]  describes  a  technique  for  finding  noun  phrase  corre¬ 
spondences  in  bilingual  corpora.  First,  (as  for  Champollion),  the 
bilingual  corpus  must  be  aligned  sentence-wise.  Then,  each  corpus 
is  run  through  a  part  of  speech  tagger  and  noun  phrase  recognizer 
separately.  Finally,  noun  phrases  are  mapped  to  each  other  using 
an  iterative  reestimation  algorithm.  In  addition  to  the  limitations 
indicated  in  [3],  it  only  handles  NPs,  whereas  collocations  have  been 
shown  to  include  parts  of  NPs,  categories  other  than  NPs  (e.g.,  verb 
phrases),  as  well  as  flexible  phrases  that  do  not  fall  into  a  single  cat¬ 
egory  but  involve  words  separated  by  an  arbitrary  number  of  other 
words,  such  as  “to  take  ..  steps,"  “to  demonstrate ...  support," etc. 
In  this  work  as  in  earlier  work  [7],  we  address  this  full  range  of 
collocations. 

5.  Conclusion 

We  have  presented  a  method  for  translating  collocations,  imple¬ 
mented  in  Champollion.  The  ability  to  compile  a  set  of  translations 
for  a  new  domain  automatically  will  ultimately  increase  the  porta¬ 
bility  of  machine  translation  systems.  The  output  of  our  system  is 
a  bilingual  lexicon  that  is  directly  applicable  to  machine  translation 
systems  that  use  a  transfer  approach,  since  they  rely  on  correspon¬ 
dences  between  words  and  phrases  of  the  source  and  target  languages. 
For  interlingua  systems,  translating  collocations  can  aid  in  augment¬ 
ing  the  interlingua;  since  such  phrases  cannot  be  translated  compo- 
sitionally,  they  indicate  where  concepts  representing  such  phrases 
must  be  added  to  the  interlingua. 

Since  Champollion  makes  few  assumptions  about  its  input,  it  can  be 
used  for  many  pairs  of  languages  with  little  modification.  Cham¬ 
pollion  can  also  be  applied  to  many  domains  of  applications  since 
it  incorporates  no  assumptions  about  the  domain.  Thus,  we  can  ob¬ 


tain  domain  specific  bilingual  collocation  dictionaries  by  applying 
Champollion  to  different  domain  specific  corpora.  Since  collocations 
and  idiomatic  phrases  are  clearly  domain  dependent,  the  facility  to 
quickly  construct  the  phrases  used  in  new  domains  is  important.  A 
tool  such  as  Champollion  is  useful  for  many  tasks  including  machine 
(aided)  translation,  lexicography,  language  generation,  and  multilin¬ 
gual  information  retrieval. 
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